{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](https://i.imgur.com/eBRPvWB.png)\n",
    "\n",
    "# Practical PyTorch: Exploring Word Vectors with GloVe\n",
    "\n",
    "When working with words, dealing with the huge but sparse domain of language can be challenging. Even for a small corpus, your neural network (or any type of model) needs to support many thousands of discrete inputs and outputs.\n",
    "\n",
    "Besides the raw number words, the standard technique of representing words as one-hot vectors (e.g. \"the\" = `[0 0 0 1 0 0 0 0 ...]`) does not capture any information about relationships between words.\n",
    "\n",
    "Word vectors address this problem by representing words in a multi-dimensional vector space. This can bring the dimensionality of the problem from hundreds-of-thousands to just hundreds. Plus, the vector space is able to capture semantic relationships between words in terms of distance and vector arithmetic.\n",
    "\n",
    "![](https://i.imgur.com/y4hG1ak.png)\n",
    "\n",
    "There are a few techniques for creating word vectors. The word2vec algorithm predicts words in a context (e.g. what is the most likely word to appear in \"the cat ? the mouse\"), while GloVe vectors are based on global counts across the corpus &mdash; see [How is GloVe different from word2vec?](https://www.quora.com/How-is-GloVe-different-from-word2vec) on Quora for some better explanations.\n",
    "\n",
    "In my opinion the best feature of GloVe is that multiple sets of pre-trained vectors are easily [available for download](https://nlp.stanford.edu/projects/glove/), so that's what we'll use here.\n",
    "\n",
    "## Recommended reading\n",
    "\n",
    "* https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/\n",
    "* https://blog.acolyer.org/2016/04/22/glove-global-vectors-for-word-representation/\n",
    "* https://levyomer.wordpress.com/2014/04/25/linguistic-regularities-in-sparse-and-explicit-word-representations/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installing torchtext\n",
    "\n",
    "The torchtext package is not currently on the PIP or Conda package managers, but it's easy to install manually:\n",
    "\n",
    "```\n",
    "git clone https://github.com/pytorch/text pytorch-text\n",
    "cd pytorch-text\n",
    "python setup.py install\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading word vectors\n",
    "\n",
    "Torchtext includes functions to download GloVe (and other) embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torchtext.vocab as vocab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded 400000 words\n"
     ]
    }
   ],
   "source": [
    "glove = vocab.GloVe(name='6B', dim=100)\n",
    "\n",
    "print('Loaded {} words'.format(len(glove.itos)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The returned `GloVe` object includes attributes:\n",
    "- `stoi` _string-to-index_ returns a dictionary of words to indexes\n",
    "- `itos` _index-to-string_ returns an array of words by index\n",
    "- `vectors` returns the actual vectors. To get a word vector get the index to get the vector:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "def get_word(word):\n",
    "    return glove.vectors[glove.stoi[word]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Finding closest vectors\n",
    "\n",
    "Going from word &rarr; vector is easy enough, but to go from vector &rarr; word takes more work. Here I'm (naively) calculating the distance for each word in the vocabulary, and sorting based on that distance:\n",
    "\n",
    "Anyone with a suggestion for optimizing this, please let me know!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def closest(vec, n=10):\n",
    "    \"\"\"\n",
    "    Find the closest words for a given vector\n",
    "    \"\"\"\n",
    "    all_dists = [(w, torch.dist(vec, get_word(w))) for w in glove.itos]\n",
    "    return sorted(all_dists, key=lambda t: t[1])[:n]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will return a list of `(word, distance)` tuple pairs. Here's a helper function to print that list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def print_tuples(tuples):\n",
    "    for tuple in tuples:\n",
    "        print('(%.4f) %s' % (tuple[1], tuple[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now using a known word vector we can see which other vectors are closest:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(0.0000) google\n",
      "(3.0772) yahoo\n",
      "(3.8836) microsoft\n",
      "(4.1048) web\n",
      "(4.1082) aol\n",
      "(4.1165) facebook\n",
      "(4.3917) ebay\n",
      "(4.4122) msn\n",
      "(4.4540) internet\n",
      "(4.4651) netscape\n"
     ]
    }
   ],
   "source": [
    "print_tuples(closest(get_word('google')))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Word analogies with vector arithmetic\n",
    "\n",
    "The most interesting feature of a well-trained word vector space is that certain semantic relationships (beyond just close-ness of words) can be captured with regular vector arithmetic. \n",
    "\n",
    "![](https://i.imgur.com/d0KuM5x.png)\n",
    "\n",
    "(image borrowed from [a slide from Omer Levy and Yoav Goldberg](https://levyomer.wordpress.com/2014/04/25/linguistic-regularities-in-sparse-and-explicit-word-representations/))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# In the form w1 : w2 :: w3 : ?\n",
    "def analogy(w1, w2, w3, n=5, filter_given=True):\n",
    "    print('\\n[%s : %s :: %s : ?]' % (w1, w2, w3))\n",
    "   \n",
    "    # w2 - w1 + w3 = w4\n",
    "    closest_words = closest(get_word(w2) - get_word(w1) + get_word(w3))\n",
    "    \n",
    "    # Optionally filter out given words\n",
    "    if filter_given:\n",
    "        closest_words = [t for t in closest_words if t[0] not in [w1, w2, w3]]\n",
    "        \n",
    "    print_tuples(closest_words[:n])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The classic example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "[king : man :: queen : ?]\n",
      "(4.0811) woman\n",
      "(4.6916) girl\n",
      "(5.2703) she\n",
      "(5.2788) teenager\n",
      "(5.3084) boy\n"
     ]
    }
   ],
   "source": [
    "analogy('king', 'man', 'queen')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's explore the word space and see what stereotypes we can uncover:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "[man : actor :: woman : ?]\n",
      "(2.8133) actress\n",
      "(5.0039) comedian\n",
      "(5.1399) actresses\n",
      "(5.2773) starred\n",
      "(5.3085) screenwriter\n",
      "\n",
      "[cat : kitten :: dog : ?]\n",
      "(3.8146) puppy\n",
      "(4.2944) rottweiler\n",
      "(4.5888) puppies\n",
      "(4.6086) pooch\n",
      "(4.6520) pug\n",
      "\n",
      "[dog : puppy :: cat : ?]\n",
      "(3.8146) kitten\n",
      "(4.0255) puppies\n",
      "(4.1575) kittens\n",
      "(4.1882) pterodactyl\n",
      "(4.1945) scaredy\n",
      "\n",
      "[russia : moscow :: france : ?]\n",
      "(3.2697) paris\n",
      "(4.6857) french\n",
      "(4.7085) lyon\n",
      "(4.9087) strasbourg\n",
      "(5.0362) marseille\n",
      "\n",
      "[obama : president :: trump : ?]\n",
      "(6.4302) executive\n",
      "(6.5149) founder\n",
      "(6.6997) ceo\n",
      "(6.7524) hilton\n",
      "(6.7729) walt\n",
      "\n",
      "[rich : mansion :: poor : ?]\n",
      "(5.8262) residence\n",
      "(5.9444) riverside\n",
      "(6.0283) hillside\n",
      "(6.0328) abandoned\n",
      "(6.0681) bungalow\n",
      "\n",
      "[elvis : rock :: eminem : ?]\n",
      "(5.6597) rap\n",
      "(6.2057) rappers\n",
      "(6.2161) rapper\n",
      "(6.2444) punk\n",
      "(6.2690) hop\n",
      "\n",
      "[paper : newspaper :: screen : ?]\n",
      "(4.7810) tv\n",
      "(5.1049) television\n",
      "(5.3818) cinema\n",
      "(5.5524) feature\n",
      "(5.5646) shows\n",
      "\n",
      "[monet : paint :: michelangelo : ?]\n",
      "(6.0782) plaster\n",
      "(6.3768) mold\n",
      "(6.3922) tile\n",
      "(6.5819) marble\n",
      "(6.6524) image\n",
      "\n",
      "[beer : barley :: wine : ?]\n",
      "(5.6021) grape\n",
      "(5.6760) beans\n",
      "(5.8174) grapes\n",
      "(5.9035) lentils\n",
      "(5.9454) figs\n",
      "\n",
      "[earth : moon :: sun : ?]\n",
      "(6.2294) lee\n",
      "(6.4125) kang\n",
      "(6.4644) tan\n",
      "(6.4757) yang\n",
      "(6.4853) lin\n",
      "\n",
      "[house : roof :: castle : ?]\n",
      "(6.2919) stonework\n",
      "(6.3779) masonry\n",
      "(6.4773) canopy\n",
      "(6.4954) fortress\n",
      "(6.5259) battlements\n",
      "\n",
      "[building : architect :: software : ?]\n",
      "(5.8369) programmer\n",
      "(6.8881) entrepreneur\n",
      "(6.9240) inventor\n",
      "(6.9730) developer\n",
      "(6.9949) innovator\n",
      "\n",
      "[boston : bruins :: phoenix : ?]\n",
      "(3.8546) suns\n",
      "(4.1968) mavericks\n",
      "(4.6126) coyotes\n",
      "(4.6894) mavs\n",
      "(4.6971) knicks\n",
      "\n",
      "[good : heaven :: bad : ?]\n",
      "(4.3959) hell\n",
      "(5.2864) ghosts\n",
      "(5.2898) hades\n",
      "(5.3414) madness\n",
      "(5.3520) purgatory\n",
      "\n",
      "[jordan : basketball :: woods : ?]\n",
      "(5.8607) golf\n",
      "(6.4110) golfers\n",
      "(6.4418) tournament\n",
      "(6.4592) tennis\n",
      "(6.6560) collegiate\n"
     ]
    }
   ],
   "source": [
    "analogy('man', 'actor', 'woman')\n",
    "analogy('cat', 'kitten', 'dog')\n",
    "analogy('dog', 'puppy', 'cat')\n",
    "analogy('russia', 'moscow', 'france')\n",
    "analogy('obama', 'president', 'trump')\n",
    "analogy('rich', 'mansion', 'poor')\n",
    "analogy('elvis', 'rock', 'eminem')\n",
    "analogy('paper', 'newspaper', 'screen')\n",
    "analogy('monet', 'paint', 'michelangelo')\n",
    "analogy('beer', 'barley', 'wine')\n",
    "analogy('earth', 'moon', 'sun') # Interesting failure mode\n",
    "analogy('house', 'roof', 'castle')\n",
    "analogy('building', 'architect', 'software')\n",
    "analogy('boston', 'bruins', 'phoenix')\n",
    "analogy('good', 'heaven', 'bad')\n",
    "analogy('jordan', 'basketball', 'woods')"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
